Goto

Collaborating Authors

 Kazan


CNSocialDepress: A Chinese Social Media Dataset for Depression Risk Detection and Structured Analysis

Xu, Jinyuan, Lan, Tian, Yu, Xintao, He, Xue, Zhang, Hezhi, Wang, Ying, Magistry, Pierre, Valette, Mathieu, Li, Lei

arXiv.org Artificial Intelligence

Depression is a pressing global public health issue, yet publicly available Chinese-language resources for risk detection remain scarce and are mostly limited to binary classification. To address this limitation, we release CNSocialDepress, a benchmark dataset for depression risk detection from Chinese social media posts. The dataset contains 44,178 texts from 233 users, within which psychological experts annotated 10,306 depression-related segments. CNSocialDepress provides binary risk labels together with structured multi-dimensional psychological attributes, enabling interpretable and fine-grained analysis of depressive signals. Experimental results demonstrate its utility across a wide range of NLP tasks, including structured psychological profiling and fine-tuning of large language models for depression detection. Comprehensive evaluations highlight the dataset's effectiveness and practical value for depression risk identification and psychological analysis, thereby providing insights to mental health applications tailored for Chinese-speaking populations.


A Real-Time Framework for Intermediate Map Construction and Kinematically Feasible Off-Road Planning Without OSM

Jerome, Otobong, Kulathunga, Geesara Prathap, Dmitry, Devitt, Murawjow, Eugene, Klimchik, Alexandr

arXiv.org Artificial Intelligence

Off-road environments present unique challenges for autonomous navigation due to their complex and unstructured nature. Traditional global path-planning methods, which typically aim to minimize path length and travel time, perform poorly on large-scale maps and fail to account for critical factors such as real-time performance, kinematic feasibility, and memory efficiency. This paper introduces a novel global path-planning method specifically designed for off-road environments, addressing these essential factors. The method begins by constructing an intermediate map within the pixel coordinate system, incorporating geographical features like off-road trails, waterways, restricted and passable areas, and trees. The planning problem is then divided into three sub-problems: graph-based path planning, kinematic feasibility checking, and path smoothing. This approach effectively meets real-time performance requirements while ensuring kinematic feasibility and efficient memory use. The method was tested in various off-road environments with large-scale maps up to several square kilometers in size, successfully identifying feasible paths in an average of 1.5 seconds and utilizing approximately 1.5GB of memory under extreme conditions. The proposed framework is versatile and applicable to a wide range of off-road autonomous navigation tasks, including search and rescue missions and agricultural operations.


Human-Annotated NER Dataset for the Kyrgyz Language

Turatali, Timur, Alekseev, Anton, Jumalieva, Gulira, Kabaeva, Gulnara, Nikolenko, Sergey

arXiv.org Artificial Intelligence

We introduce KyrgyzNER, the first manually annotated named entity recognition dataset for the Kyrgyz language. Comprising 1,499 news articles from the 24.KG news portal, the dataset contains 10,900 sentences and 39,075 entity mentions across 27 named entity classes. We show our annotation scheme, discuss the challenges encountered in the annotation process, and present the descriptive statistics. We also evaluate several named entity recognition models, including traditional sequence labeling approaches based on conditional random fields and state-of-the-art multilingual transformer-based models fine-tuned on our dataset. While all models show difficulties with rare entity categories, models such as the multilingual RoBERTa variant pretrained on a large corpus across many languages achieve a promising balance between precision and recall. These findings emphasize both the challenges and opportunities of using multilingual pretrained models for processing languages with limited resources. Although the multilingual RoBERTa model performed best, other multilingual models yielded comparable results. This suggests that future work exploring more granular annotation schemes may offer deeper insights for Kyrgyz language processing pipelines evaluation.


The benefits of query-based KGQA systems for complex and temporal questions in LLM era

Alekseev, Artem, Chaichuk, Mikhail, Butko, Miron, Panchenko, Alexander, Tutubalina, Elena, Somov, Oleg

arXiv.org Artificial Intelligence

Large language models excel in question-answering (QA) yet still struggle with multi-hop reasoning and temporal questions. Query-based knowledge graph QA (KGQA) offers a modular alternative by generating executable queries instead of direct answers. We explore multi-stage query-based framework for WikiData QA, proposing multi-stage approach that enhances performance on challenging multi-hop and temporal benchmarks. Through generalization and rejection studies, we evaluate robustness across multi-hop and temporal QA datasets. Additionally, we introduce a novel entity linking and predicate matching method using CoT reasoning. Our results demonstrate the potential of query-based multi-stage KGQA framework for improving multi-hop and temporal QA with small language models. Code and data: https://github.com/ar2max/NLDB-KGQA-System


SemIRNet: A Semantic Irony Recognition Network for Multimodal Sarcasm Detection

Zhou, Jingxuan, Wu, Yuehao, Zhang, Yibo, Zhang, Yeyubei, Liu, Yunchong, Huang, Bolin, Yuan, Chunhong

arXiv.org Artificial Intelligence

Aiming at the problem of difficulty in accurately identifying graphical implicit correlations in multimodal irony detection tasks, this paper proposes a Semantic Irony Recognition Network (SemIRNet). The model contains three main innovations: (1) The ConceptNet knowledge base is introduced for the first time to acquire conceptual knowledge, which enhances the model's common-sense reasoning ability; (2) Two cross-modal semantic similarity detection modules at the word level and sample level are designed to model graphic-textual correlations at different granularities; and (3) A contrastive learning loss function is introduced to optimize the spatial distribution of the sample features, which improves the separability of positive and negative samples. Experiments on a publicly available multimodal irony detection benchmark dataset show that the accuracy and F1 value of this model are improved by 1.64% and 2.88% to 88.87% and 86.33%, respectively, compared with the existing optimal methods. Further ablation experiments verify the important role of knowledge fusion and semantic similarity detection in improving the model performance.


MT-RAIG: Novel Benchmark and Evaluation Framework for Retrieval-Augmented Insight Generation over Multiple Tables

Seo, Kwangwook, Kwon, Donguk, Lee, Dongha

arXiv.org Artificial Intelligence

Recent advancements in table-based reasoning have expanded beyond factoid-level QA to address insight-level tasks, where systems should synthesize implicit knowledge in the table to provide explainable analyses. Although effective, existing studies remain confined to scenarios where a single gold table is given alongside the user query, failing to address cases where users seek comprehensive insights from multiple unknown tables. To bridge these gaps, we propose MT-RAIG Bench, design to evaluate systems on Retrieval-Augmented Insight Generation over Mulitple-Tables. Additionally, to tackle the suboptimality of existing automatic evaluation methods in the table domain, we further introduce a fine-grained evaluation framework MT-RAIG Eval, which achieves better alignment with human quality judgments on the generated insights. We conduct extensive experiments and reveal that even frontier LLMs still struggle with complex multi-table reasoning, establishing our MT-RAIG Bench as a challenging testbed for future research.


International underwater cable attacks by Russia, China are no 'mere coincidence' warns EU's top diplomat

FOX News

Attacks on underwater cables running through strategically significant bodies of water in both the Baltic Sea and the South China Sea by Russia and China, respectively, in recent months has top officials concerned they are not "mere coincidence." Maritime sabotage efforts in both regions of the world appear to have been on the rise over the last several years, with a notable spike in recent months after at least three separate attacks occurred in as many months, beginning in November, and the top suspects are Russia and China. "The Kremlin has been running a hybrid campaign against Europe for years, ranging from spreading disinformation and cyberattacks to weaponizing energy supplies. Since Russia's full-scale invasion of Ukraine, these efforts have intensified dramatically," EU High Representative Kaja Kallas told Fox News Digital. "However, Russia is not the only challenge we face."


Disparate Model Performance and Stability in Machine Learning Clinical Support for Diabetes and Heart Diseases

Bilionis, Ioannis, Berrios, Ricardo C., Fernandez-Luque, Luis, Castillo, Carlos

arXiv.org Artificial Intelligence

Machine Learning (ML) algorithms are vital for supporting clinical decision-making in biomedical informatics. However, their predictive performance can vary across demographic groups, often due to the underrepresentation of historically marginalized populations in training datasets. The investigation reveals widespread sex- and age-related inequities in chronic disease datasets and their derived ML models. Thus, a novel analytical framework is introduced, combining systematic arbitrariness with traditional metrics like accuracy and data complexity. The analysis of data from over 25,000 individuals with chronic diseases revealed mild sex-related disparities, favoring predictive accuracy for males, and significant age-related differences, with better accuracy for younger patients. Notably, older patients showed inconsistent predictive accuracy across seven datasets, linked to higher data complexity and lower model performance. This highlights that representativeness in training data alone does not guarantee equitable outcomes, and model arbitrariness must be addressed before deploying models in clinical settings.


Semantic Component Analysis: Discovering Patterns in Short Texts Beyond Topics

Eichin, Florian, Schuster, Carolin M., Groh, Georg, Hedderich, Michael A.

arXiv.org Artificial Intelligence

Topic modeling is a key method in text analysis, but existing approaches are limited by assuming one topic per document or fail to scale efficiently for large, noisy datasets of short texts. We introduce Semantic Component Analysis (SCA), a novel topic modeling technique that overcomes these limitations by discovering multiple, nuanced semantic components beyond a single topic in short texts which we accomplish by introducing a decomposition step to the clustering-based topic modeling framework. We evaluate SCA on Twitter datasets in English, Hausa and Chinese. It achieves competetive coherence and diversity compared to BERTopic, while uncovering at least double the semantic components and maintaining a noise rate close to zero. Furthermore, SCA is scalable and effective across languages, including an underrepresented one.


HJ-Ky-0.1: an Evaluation Dataset for Kyrgyz Word Embeddings

Alekseev, Anton, Kabaeva, Gulnara

arXiv.org Artificial Intelligence

One of the key tasks in modern applied computational linguistics is constructing word vector representations (word embeddings), which are widely used to address natural language processing tasks such as sentiment analysis, information extraction, and more. To choose an appropriate method for generating these word embeddings, quality assessment techniques are often necessary. A standard approach involves calculating distances between vectors for words with expert-assessed 'similarity'. This work introduces the first 'silver standard' dataset for such tasks in the Kyrgyz language, alongside training corresponding models and validating the dataset's suitability through quality evaluation metrics.